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This study assessed the relationships between characteristics of biographical items 
from the Armed Services Applicant Profile and the items’ validity in predicting the 
retention of enlisted military personnel. Item characteristics were appraised with 
ratings by expert judges and test takers, word and alternative counts, and response 
latencies. Item content was also appraised with ratings by expert judges. The more 
valid items involved overt behavior or experiences, dealt with discrete behavior or 
experiences, and had heterogeneous content. After controlling for item content, only 
the latter characteristic was related to validity. Item characteristics and item content 
interacted in several instances. 


Despite the long history and wide use of biographical inventories (see recent 
reviews by Breaugh, 2009; Stokes & Cooper, 2003; Stokes, Mumford, & Owens, 
1994), little is known about the characteristics of valid items. Several well-known 
guides for writing biographical items exist (Asher, 1972; Mael, 1991; Mumford 
& Owens, 1987; Mumford & Stokes, 1992; Owens, 1976; Owens, Glennon, 
& Albright, 1962), but the empirical underpinnings for their prescriptions are 
limited. 
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Several studies appraised the validity of item characteristics against behavioral 
criteria. Three of these investigations used all or most of the 10 item charac- 
teristics in Mael’s (1991) taxonomy of biographical items (Graham, McDaniel, 
Douglas, & Snell, 2002; Lefkowitz, Gebbia, Balsam, & Dunn, 1999; McManus & 
Masztal, 1999). Agreement in the findings was modest: valid items were verifi- 
able (Graham et al., 2002; McManus & Masztal, 1999), not first-hand (they did 
not ask for the respondent’s own evaluation of his or her performance or attitudes; 
Graham et al, 2002; Lefkowitz et al., 1999), and not controllable (they concerned 
the respondent’s physical and social characteristics or actions taken by others; 
Graham et al., 2002; Lefkowitz et al., 1999). 

Another study (Barge, 1987, 1988) assessed three item characteristics, includ- 
ing one in the Mael taxonomy: the valid items were discrete (they involved a single 
unique act or simple count of unique events), in agreement with the McManus and 
Masztal findings. Valid items also concerned samples rather than signs of behavior 
(Wernimont & Campbell, 1968) and had homogeneous content. 

It is noteworthy that only the McManus and Masztal investigation assessed 
validity in a high-stakes situation: the biographical inventory was used in select- 
ing applicants for employment. This distinction may be important because of 
the possibility that test takers distort their responses on self-report measures in 
such a situation, affecting the validity of items and their associations with item 
characteristics. The susceptibility of biographical inventories to distortion is well 
established, though the consequences for validity are uncertain (see the review by 
Lautenschlager, 1994). Faking good in research studies and distortion in high- 
stakes settings are distinguishable, but it is nevertheless instructive that in the 
Graham et al. study all associations of item characteristics with the items’ validity 
observed when participants were instructed to answer honestly disappeared when 
they were asked to fake good. 

One issue not addressed thus far in this itemmetric research is the potential con- 
founding of the items’ characteristics and their content (Barge, 1988; Lefkowitz 
et al., 1999). The pools of items in these studies were heterogeneous in their con- 
tent, raising the possibility that, for example, discrete items are more valid than 
nondiscrete items, because discrete items happen to concern school achievement, 
and items about school achievement are more valid than items with other content. 
A related possibility is that item characteristics and content interact. For example, 
discrete items are more valid than nondiscrete items when the content is school 
achievement but not when the content is something else. 

In view of the limited work on the validity of biographical items, especially in 
high-stakes settings, and the uncertain influence of item content on the previous 
findings, the aim of this study was to assess the link between a comprehensive 
set of potentially important characteristics of biographical items and the items’ 
empirical validity in a selection situation, controlling for the items’ content and 
assessing the interaction between content and item characteristics. 
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METHOD 


Overview 


The biographical items were 120 items from a larger pool assembled for the 
Armed Services Applicant Profile (ASAP; Trent, 1993). The ASAP, designed to 
predict the adaptability of enlisted personnel to military service, is a traditional 
biographical inventory made up of a heterogeneous collection of items chosen for 
their potential relevance to adjustment and empirically keyed. The item content 
encompasses physical involvement, school achievement, delinquency, work ethic, 
independence, and social adaptation. The items are in a multiple-choice format, 
with three to five alternatives. Fifty-item forms of the ASAP correlated .29 with 
retention at the end of enlistment—usually 48 months. 

The item characteristics were selected on the basis of previous studies and 
commentaries about the characteristics of biographical and personality items 
(Angleitner, John, & Lohr, 1986; Asher, 1972; Barge, 1987, 1988; Blaney, 
1991; Goldberg, 1968; Holden & Fekken, 1990; Holden, Fekken, & Jackson, 
1985; Johnson, 2004; Mael, 1991; Owens et al., 1962; Werner & Pervin, 1986; 
Wiggins & Goldberg, 1965). A large number of characteristics were initially 
identified. They were subsequently winnowed down by eliminating those that 
overlapped with each other or were inapplicable to biographical items, in general, 
or to the ASAP items, in particular. The final set of characteristics was measured 
by experts’ or test takers’ ratings, word and alternative counts, and test takers’ 
response latencies. 

The item content was assessed by experts. The six content areas, noted ear- 
lier, had been identified in a factor analysis of a subset of the ASAP items 
(Trent, 1993). 

The items’ validity was their associations with the retention of military recruits. 


Expert Ratings of Item Characteristics 


Four item characteristics were assessed by raters with graduate training or PhDs 
in psychology: Overt Behavior or Experience (two raters), Discrete Behavior 
or Experience (two raters), Homogeneous vs. Heterogeneous Content (eight 
raters), and Face Validity (eight raters). Raters assessed a single characteris- 
tic. Characteristics were rated on a three-point scale: Definitely, Somewhat or 
Uncertain, Not at All (scored 3, 2, 1, respectively). Each item’s mean rating on 
a characteristic was used in the analysis. 


Overt behavior or experience. This variable was suggested by Asher 
(1972). The rating devised for this study was: “Describes overt behavior or 
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experience (e.g., took a driver’s education course, had a flat tire) rather than an 
internal state of mind (e.g., likes to drive).” 


Discrete behavior or experience. This variable was used by Barge (1987, 
1988), and the rating was adapted from his own: “Requires test takers to report 
discrete overt behavior or experience (a single instance of behavior or a single 
experience [e.g., date of last auto accident] or a simple count of instances of the 
behavior or experience [e.g., number of auto accidents]) rather than a subjective 
summary of overt behavior or experience (e.g., average miles driven per week) or 
a global evaluation of a trait (e.g., self-rating of driving ability).” 


Homogeneous vs. heterogeneous content. This variable was used by 
Barge, and the rating was adapted from his own: “Describes something (overt 
behavior or experience, or internal state of mind) that reflects a single trait 
(e.g., never being absent from school reflects dependability; worrying a lot reflects 
anxiety) rather than something that reflects several traits (e.g., being on the dean’s 
list in school reflects intelligence, motivation, etc.; being shy reflects lack of 
confidence, inadequate social skills, etc.).” 


Face validity. This variable was used by Holden and Fekken (1990). The 
rating devised for this study was: “Obviously relevant in assessing the adaptability 
of recruits to military service.” 


Other Item Characteristics 


Ambiguity. This variable was used by Gordon (1953). The rating devised 
for this study was: “Considering everything about the question (including its 
answers), how clear was it?” It was rated on a five-point scale: Extremely Clear, 
Very Clear, Somewhat Clear, Slightly Clear, Not Clear at All (scored 1, 2, 3, 4, 5, 
respectively). The raters were 137 Navy recruits in a research study. Each item’s 
mean rating was used in the analysis. 


Number of words. This variable was used by Holden and Fekken (1990). It 
is the total number of words in the item’s stem and alternatives. 


Number of alternatives. This is the number of alternatives for the item. 


Response latency. This variable was used by Holden and Fekken (1990). 
It is the time (in hundredths of a second) between when the item was presented in 
a computer administration and when the response was made. The latencies were 
obtained from 1,090 Navy recruits in a research study (Stricker & Alderton, 1999). 
Each item’s median time was used in the analysis. 
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Expert Ratings of Content 


The six content areas were assessed by three PhD-level psychologists with train- 
ing in personality. The factors’ labels and content (Trent, 1993) were used to 
define the content areas, with only minor changes for clarity (e.g., the School 
Achievement factor was relabeled School Involvement). Raters classified the 
items into one of the six content areas (or an “other” category), using a sorting 
procedure (Stricker & Rock, 1998). Each item’s consensus classification (the one 
chosen by at least two raters) was used in the analysis. (Items without a consen- 
sus classification were included in the “other” classification in the analysis.) The 
instructions follow: 


Please read each item (including its alternatives) and decide whether it appears to 
measure one of six factors: Physical Involvement, School Involvement, Delinquency, 
Work Ethic, Independence, and Social Adaptation. The factors were identified in 
factor analyses of some of these items. This is a summary of the items defining the 
factors: 

Physical Involvement. School athletic team membership, extent of athletic activi- 
ties, quality of athletic performance, preference for white-collar or blue-collar work, 
physical demands of military training, and childhood happiness. 

School Involvement. School grades, failing courses, skipping or failing grades, 
school course subjects, school club participation, attitudes toward school and teach- 
ers, and college aspirations; disciplinary actions, suspensions, expulsions, and 
authorized or unauthorized absence. 

Delinquency. Drinking, smoking, running away from home, troublemaking, and 
police/arrest involvement. 

Work Ethic. Employment status, quality and duration of employment, and job 
preference. 

Independence. Social independence, economic self-sufficiency, independent 
friends, motivation level, age, number of full-time jobs, fired from a job, and tattoos. 

Social Adaptation. Social alienation, traditional values, sociability, risk-taking, 
autonomy from parents, dominance, problem solving, flexibility, and sickness. 

If the item appears to be primarily a measure of a particular factor, put it in the 
pile for that factor. If the item does not appear to be primarily a measure of any of 
the factors or appears to measure two or more of the factors more-or-less equally 
well, put it in the “Other” pile. 


Validity 


This is the association between the item and the criterion, retention at the end 
of 21 months of service in two samples of military recruits for the four services. 
The recruits took experimental forms of the ASAP, administered with instructions 
that the inventory was being used for selection, when they applied for enlistment 
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(Trent, 1993).' Sample 1 (V = 13,501, 79% retained) and Sample 2 (N = 13,093, 
79% also retained) took different 130-item forms, including the 120 items in 
this study (64 of the latter were common to both forms). Cramer’s V (Blalock, 
1979; Hays, 1994) was computed for each item from the contingency table for 
the item’s alternatives (3 to 5) and the dichotomous criterion (retention-attrition). 
(V is a generalization of the phi coefficient for 2 x 2 contingency tables to larger 
tables—3 x 2 to 5 x 2 in this study; its values range from 0 to 1—note that it is 
nondirectional.) V was computed separately for each sample. Each item’s V was 
used in the analysis. (In cases where an item was administered to both samples, 
the mean V was used.) 


Analysis 


The interrater reliability of the expert and recruit ratings of the item characteristics 
was estimated by the intraclass correlation (Shrout & Fleiss, 1979, Case 1, for 
mean ratings). The reliability of the experts’ content classifications was estimated 
by their mean Kappa (Conger, 1980). The reliability of the item validity index 
was estimated from the product-moment correlation between the indices for the 
64 common items in the two samples. 

Product-moment intercorrelations among the item characteristics, item con- 
tent variables, and item validity index were computed. Semipartial correlations 
were computed between each of the item characteristics and the validity index, 
partialing the set of six substantive content variables (dummy coded) out of the 
item characteristic. (The “other” content variable was not used in order to avoid 
collinearity among the content variables.) The interaction between each of the 
item characteristics and the set of six content variables vis-a-vis the item valid- 
ity index was assessed by hierarchical multiple regression analyses. For statistical 
significance, the .05 alpha level was used. For practical significance, a correlation 
of .10 and semipartial correlation (sr) of .14 was used; these are “small” effect 
sizes, accounting for 1% and 2% of the variance, respectively (Cohen, 1988). 


RESULTS 


Reliability of Variables 


The interrater reliability of the ratings is reported in Table 1. The reliability of the 
expert ratings of item characteristics ranged from .78 for Face Validity to .69 for 


'The instructions were: “Responses to this questionnaire will be assessed to determine applicants’ 
suitability for military service. Applicants are not required to provide this information; however, failure 
to do so could affect an applicant’s chances of being selected for service. An overall score on the 
questionnaire may become part of your permanent military record.” 
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TABLE 1 
Interrater Reliability of Item Characteristic Ratings 


Characteristic N IGG? 


Overt behavior or experience 2 76 
Discrete behavior or experience 2 77 
Homogeneous vs. heterogeneous content 8 69 
Face validity 8 78 
Ambiguity 137 86 


*TIntraclass correlation. 


Homogeneous vs. Heterogeneous Content, and was .86 for the recruits’ rating of 
Ambiguity. The mean Kappa was .47 for the experts’ content classifications.” The 
validity indices for the two samples correlated .94. 


Correlations of Item Characteristics With Item Validity 


The intercorrelations of the item characteristics, item content variables, and item 
validity index, and the semipartial correlations of the item characteristics with the 
item validity index are reported in Table 2. Three characteristics, all expert rat- 
ings, correlated significantly with item validity: Homogeneous vs. Heterogeneous 
Content (r = —.32), Discrete Behavior or Experience (r = .29), and Overt 
Behavior or Experience (7 =. 27). That is, the more valid items had heterogeneous 
content, dealt with discrete behavior or experiences, and involved overt behavior 
or experiences. However, when item content was partialed out, only one character- 
istic correlated significantly with item validity, Homogeneous vs. Heterogeneous 
Content (sr = —.24): the more valid items had heterogeneous content. 


Interactions of Item Characteristics and Item Content With Item Validity 


The hierarchical regression analyses for each item characteristic and the set of six 
content variables are reported in Table 3. Interactions with item content were sig- 
nificant for two variables: Ambiguity (sr = .37) and Response Latency (sr = .33).° 

In order to clarify these interactions, additional hierarchical regression analyses 
were carried out for the individual content variables with significant interactions: 


?ighty-two percent of the items were classified in the six substantive content areas: Physical 
Involvement, N = 9; School Involvement, N = 24; Delinquency, N = 7; Work Ethic, N = 19; 
Independence, N = 14; and Social Adaptation, N = 25. 

3 sr is the effect size for the interaction between the item characteristics and the item content vari- 
ables: the increase in the multiple correlation with the criterion when the interaction is added to the 
multiple regression of the item characteristic and the content variables (Cohen, 1988; Cohen, Cohen, 
West, & Aiken, 2002). 
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TABLE 4 
Hierarchical Regression Analyses of Selected Item Characteristics and Individual Content 
Variables With Item Validity 


AR? 
Ambiguity and Ambiguity and Response Latency 
Predictor df Delinquency School Involvement and Delinquency 
Step 1 
Item content variable 1,118 A2** OF 12” 
Step 2 
Item content variable 1,117 01 .02 .00 
and item 
characteristic 
Step 3 
Item content variable, 1,116 O77" .04* .09** 


item characteristic, 
and interaction 


Note. N = 120. *p < .05, **p < .01. 


Delinquency in both analyses and School Involvement in the Ambiguity analysis. 
These analyses are reported in Table 4. All interactions were again significant: 
Delinquency with Ambiguity (sr = .26) and Response Latency (sr = .30), 
and School Involvement with Ambiguity (sr = .20). In the interactions with 
Delinquency, Ambiguity and Response Latency were negatively related to validity 
for these items (B = —.00022 and —.00001, respectively) and were unrelated for 
the other items (B = —.00001 and .00000, respectively). In the interaction with 
School Involvement, Ambiguity was negatively related to validity for these items 
(B = —.00015) and was unrelated for the other items (B = —.00001). In sum, 
unlike items with other content, the more valid Delinquency items were unam- 
biguous and responded to quickly, and the more valid School Involvement items 
were also unambiguous. 


DISCUSSION 


The central finding of this study is that the links between biographical item char- 
acteristics and the items’ validity were both limited and complex: just a few 
characteristics played any role, and they were confounded or interacted with item 
content. 

It is remarkable that only three of the eight item characteristics evaluated were 
initially related to the items’ validity, and only one continued to be related when 
item content was controlled. Although just three interactions between the item 
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characteristics and item content emerged, they had some consistency, clustering 
around two item characteristics and two item content variables. 

The findings have some similarities and differences between the few previous 
results concerning the same item characteristics. In common with the two studies 
of discreteness (Barge, 1987, 1988; McManus and Masztal, 1999), the present 
investigation observed (before controlling for content) that discrete items were 
more valid than nondiscrete items. However, this relationship disappeared after 
controlling for content. 

In contrast to the results in the single investigation that assessed homogene- 
ity vs. heterogeneity and found that heterogeneous items were more valid than 
homogeneous items (Barge, 1987, 1988), homogeneous items were less valid in 
the present study. One speculation is that this divergent outcome stems from dif- 
ferences in the criteria. The retention criterion in this study was multifaceted, 
with many determinants, and may best be tapped by heterogeneous items. The 
variety of criteria in the other investigation were more discrete and relatively less 
complex, and perhaps better captured by homogeneous items.* The homogeneity- 
heterogeneity of content may be a very relevant feature of biographical items, its 
operation dependent on the nature of the validity criterion. 

The confounding of item characteristics with content observed in this study, 
confirming previous concerns (Barge, 1987, 1988; Lefkowitz et al., 1999), and the 
interaction of these characteristics with content necessarily raise serious questions 
about the interpretability of findings from other itemmetric studies of the relations 
between characteristics of biographical items and the items’ validity (Barge, 1987, 
1988; Graham et al, 2002; Lefkowitz et al., 1999; McManus & Masztal, 1999). 
None of these investigations controlled for content or assessed interactions with 
it. Further work in this line of research clearly needs to take content into account. 

Experiments that systematically manipulate items’ characteristics and con- 
tent may be the method of choice in such research (Barge, 1988). This 
tactic can ensure that all relevant characteristics and content are ade- 
quately represented, and the effect of each characteristic and each kind 
of content, free of any confounding, can be estimated directly and readily 
interpreted. 

Whatever the direction of research in this area, it is critical to distinguish 
between data that comes from high-stakes operational settings and low-stakes 
research settings, given the potential for distortion to influence the results (Graham 
et al., 2002). 

Although the content variables were simply employed as controls in this study, 
their substantial effect on the results merits attention. This outcome runs counter 


“For each job, training performance was assessed by each of several hands-on or job knowledge 
tests, and job performance was measure by ratings on each of five dimensions. The validity was 
estimated for each of the seven or eight components separately for each job. 
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to the belief that subtlety, a claimed benefit of empirical keying, is a requisite 
for validity in high-stakes settings (e.g., Hough & Paullin, 1994; Vasilopoulos & 
Cucina, 2006). This finding about the importance of a small, defined set of content 
variables also adds weight to the call for a construct-oriented approach to the 
development of biographical measures (e.g., Mumford & Owens, 1987). 

The findings have some implications for devising or selecting valid biograph- 
ical items. The interaction with response latency lends support to the suggestion 
that this variable may be useful in selecting valid biographical items (Stricker & 
Alderton, 1999). And the interaction with ambiguity reinforces the standard 
practice of focusing on items that are clear. 

In interpreting the sparse findings, the source of the item pool should be consid- 
ered. The pool was not a random collection of newly minted items. Rather, they 
came originally from two inventories used operationally or in research, and the 
items were extensively screened for pertinence to military adjustment and free- 
dom from unfairness, intrusiveness, and the like. Item characteristics may have 
also played some role, at least implicitly, in this process. For these reasons, really 
egregious items are probably absent from the pool. (See Trent, 1993). 

An inevitable issue is the generalizability of these results. The item charac- 
teristics represent an array of major variables. The ASAP typifies traditional, 
empirically keyed biographical inventories. The data for the retention criterion 
are close to optimal, given the extremely large sample and their extended length 
of service. And the criterion closely matches the ASAP’s purpose. But, of course, 
there are other item characteristics, pools of biographical items, behavioral crite- 
ria for assessing their validity, and populations of test takers. Follow-up research 
is very much in order to confirm or disconfirm the nuanced contribution of 
item characteristics to the validity of biographical items that was observed in 
this study. 
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